Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unstructured data to structured data conversion via EXTRACT_COLUMN #1338

Open
wants to merge 10 commits into
base: staging
Choose a base branch
from

Conversation

hershd23
Copy link
Contributor

@hershd23 hershd23 commented Nov 3, 2023

Added custom function for extracting columns from unstructured data
new file: ../evadb/functions/extract_columns.py

	new file:   ../evadb/functions/extract_columns.py
@hershd23 hershd23 requested review from xzdandy and gaurav274 November 3, 2023 08:26
@hershd23 hershd23 marked this pull request as draft November 3, 2023 08:26
@hershd23
Copy link
Contributor Author

hershd23 commented Nov 3, 2023

@xzdandy I created a python notebook as well but it gets gitignored while the rest of the tutorial notebooks don't any idea?

	new file:   20-structured-data.ipynb
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@hershd23 hershd23 changed the title [WIP] Unstructured data to structured data conversion [WIP] Unstructured data to structured data conversion via EXTRACT_COLUMNS Nov 3, 2023
@hershd23
Copy link
Contributor Author

hershd23 commented Nov 3, 2023

Solves #1235

@hershd23 hershd23 requested a review from pchunduri6 November 3, 2023 08:36
@xzdandy xzdandy added High Effort 🏋 Difficult solution or problem to solve AI Engines Features, Bugs, related to AI Engines labels Nov 3, 2023
@xzdandy xzdandy linked an issue Nov 3, 2023 that may be closed by this pull request
2 tasks
@hershd23
Copy link
Contributor Author

hershd23 commented Nov 28, 2023

@xzdandy @pchunduri6 moved to a "one-column-at-a-time" implementation as you recommended.

The notebook has the implementation

	modified:   .gitignore
Added file to extract on column at a time
	new file:   evadb/functions/extract_column.py
Removed the previous implementation
	deleted:    evadb/functions/extract_columns.py
Updated the notebook
	modified:   tutorials/20-structured-data.ipynb
@hershd23 hershd23 marked this pull request as ready for review November 29, 2023 04:58
@hershd23
Copy link
Contributor Author

For one column at a time I think this PR is ready for review @xzdandy @pchunduri6.

For the other changes discussed with either of you, I think it makes sense to take that up in a separate PR else this will bloat. Let me know what you think

@hershd23 hershd23 changed the title [WIP] Unstructured data to structured data conversion via EXTRACT_COLUMNS Unstructured data to structured data conversion via EXTRACT_COLUMNS Nov 29, 2023
@hershd23 hershd23 changed the title Unstructured data to structured data conversion via EXTRACT_COLUMNS Unstructured data to structured data conversion via EXTRACT_COLUMN Nov 29, 2023
@xzdandy
Copy link
Collaborator

xzdandy commented Nov 29, 2023

Can we also add a long integration test for the function under https://github.com/georgia-tech-db/evadb/tree/staging/test/integration_tests/long/functions? We can skip the test in circle ci due to openai key, but I think it is good to have one.

It can either be end-to-end (i.e., SQL queries) or directly test the function class.

@hershd23
Copy link
Contributor Author

Yes @xzdandy on it

Hersh Dhillon added 2 commits November 30, 2023 02:15
	modified:   evadb/functions/extract_column.py
	new file:   test/integration_tests/long/functions/test_extract_column.py
	modified:   tutorials/20-structured-data.ipynb
@hershd23
Copy link
Contributor Author

Also this is failing the linter check for a Colab Notebook. Can you point me towards information on how to add that

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to skip the notebook test at

PYTHONPATH=./ python -m pytest --durations=5 --nbmake --overwrite "./tutorials" --capture=sys --tb=short -v --log-level=WARNING --nbmake-timeout=3000 --ignore="tutorials/08-chatgpt.ipynb" --ignore="tutorials/14-food-review-tone-analysis-and-response.ipynb" --ignore="tutorials/15-AI-powered-join.ipynb" --ignore="tutorials/16-homesale-forecasting.ipynb" --ignore="tutorials/17-home-rental-prediction.ipynb" --ignore="tutorials/18-stable-diffusion.ipynb" --ignore="tutorials/19-employee-classification-prediction.ipynb"
due to open ai key

@xzdandy
Copy link
Collaborator

xzdandy commented Dec 1, 2023

Also this is failing the linter check for a Colab Notebook. Can you point me towards information on how to add that

Remove the last empty cell.

@hershd23
Copy link
Contributor Author

hershd23 commented Dec 1, 2023

12-01-2023 17:31:12 [check_notebook_format:295] ERROR: ERROR: Notebook /Users/hershdhillon23/projects/evadb/script/formatting/../../tutorials/20-structured-data.ipynb does not contain correct Colab link -- update the link.

Do not have a collar link right now

@xzdandy
Copy link
Collaborator

xzdandy commented Dec 2, 2023

12-01-2023 17:31:12 [check_notebook_format:295] ERROR: ERROR: Notebook /Users/hershdhillon23/projects/evadb/script/formatting/../../tutorials/20-structured-data.ipynb does not contain correct Colab link -- update the link.

Do not have a collar link right now

The current notebook actually does not work on the colab. I was trying to make it work yesterday and I think it needs several modifications. One fix can help is that can you add the EXTRACT_COLUMN to bootstrap functions in https://github.com/georgia-tech-db/evadb/blob/staging/evadb/functions/function_bootstrap_queries.py

@pchunduri6
Copy link
Contributor

Should we perform this operation using ChatGPT directly or use something like pandasAI to write a function using LLM and then extract the column we need? Writing a function is much cheaper token cost-wise, but less robust.
@hershd23 @xzdandy Any thoughts?

@xzdandy
Copy link
Collaborator

xzdandy commented Dec 4, 2023

Should we perform this operation using ChatGPT directly or use something like pandasAI to write a function using LLM and then extract the column we need? Writing a function is much cheaper token cost-wise, but less robust.
@hershd23 @xzdandy Any thoughts?

Hi @pchunduri6, I think it depends on the task. If the extract column is based on patterns, I think we can generate regex for saving the cost and improve efficiency. On the other hand, if the task is semantic based, we need to rely on the LLM to extract the information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AI Engines Features, Bugs, related to AI Engines High Effort 🏋 Difficult solution or problem to solve
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Introduce EXTRACT_COLUMNS to extract structured tables from unstructured text
3 participants